layout: true


About this course

In this course, we will provide an introduction to the basic concepts and functionalities of R and go through a prototypical data analysis workflow: import, wrangling, exploration, (basic) analysis, and reporting.

By the end of this course you shouldโ€ฆ

Note: This is not a statistics workshop. Our focus will be on learning how to use R.


Learning by coding


Prerequisites for this course

.large[ - Working versions of R (<= version 4.0.0) and RStudio on your computer


About us

Johannes Breuer

.small[ - Senior researcher in the team Data Augmentation at the GESIS department Survey Data Curation and (co-)leader of the team Research Data & Methods at the Center for Advanced Internet Studies (CAIS)

  • Main areas:

    • digital trace data for social science research
    • data linking (surveys + digital trace data)
  • Ph.D.ย in Psychology, University of Cologne

  • Previously worked in several research projects investigating the use and effects of digital media (Cologne, Hohenheim, Mรผnster, Tรผbingen)

  • Other research interests

    • computational methods
    • data management
    • open science

johannes.breuer@gesis.org | [@MattEagle09](https://twitter.com/MattEagle09) | personal website ]


About us

Stefan Jรผnger

.pull-left[ ]

.pull-right[ - Postdoctoral researcher in the team Data Augmentation at the GESIS department Survey Data Curation - Ph.D.ย in social sciences, University of Cologne]

  • Research interests:
    • quantitative methods & Geographic Information Systems (GIS)
    • social inequalities & attitudes towards minorities
    • data management & data privacy
    • reproducible research

.small[ stefan.juenger@gesis.org | [@StefanJuenger](https://twitter.com/StefanJuenger) | https://stefanjuenger.github.io]


Our jouRneys

.small[ Johannes - was socialized with SPSS - was annoyed with AMOS when learning structural equation modeling (around 2011) - decided to learn to use the lavaan package for R instead of MPlus to avoid being dependent on yet another proprietary software package - attended an introductory Data analysis with R course at GESIS in 2012 - only used R for SEM for some time, while still doing everything else (esp.ย data wrangling) with SPSS - finally made the full transition to R when joining GESIS in 2017

Stefan - learned statistical โ€˜programmingโ€™ when SPSS was still the major player in town - got hooked by R somewhere around 2008 or 2009 because of the plots - wrote horrible code and estimated multilevel models that took forever to be estimated - switched to R for geospatial data in 2015, wrote his first (bad) R package for geo-stuff - tried Python, uses Python occasionally, but is forever in love with R โค๏ธ ]


Keep calm and carry on learning R

Artwork by Allison Horst


About you

Please try to keep it short (3 to 4 sentences or ~30 secs).


Workshop Structure & Materials

.center[https://github.com/jobreu/r-intro-gesis-2021]


Online format


Course schedule

Day Time Topic
Monday 10:30 - 11:30 Getting Started with R and RStudio
Monday 11:30 - 11:45 Break
Monday 11:45 - 12:45 Getting Started with R and RStudio
Monday 12:45 - 13:45 Lunch Break
Monday 13:45 - 15:00 Data Import & Export
Monday 15:00 - 15:15 Break
Monday 15:15 - 16:30 Data Import & Export

Course schedule

Day Time Topic
Tuesday 10:00 - 11:15 Data Wrangling - Basics
Tuesday 11:15 - 11:30 Break
Tuesday 11:30 - 12:45 Data Wrangling - Basics
Tuesday 12:45 - 13:45 Lunch Break
Tuesday 13:45 - 15:00 Data Wrangling - Advanced
Tuesday 15:00 - 15:15 Break
Tuesday 15:15 - 16:30 Data Wrangling - Advanced
Wednesday 10:00 - 11:15 Exploratory Data Analysis
Wednesday 11:15 - 11:30 Break
Wednesday 11:30 - 12:45 Exploratory Data Analysis
Wednesday 12:45 - 13:45 Lunch Break
Wednesday 13:45 - 15:00 Data Visualization - Part 1
Wednesday 15:00 - 15:15 Break
Wednesday 15:15 - 16:30 Data Visualization - Part 1

Course schedule

Day Time Topic
Thursday 10:00 - 11:15 Confirmatory Data Analysis
Thursday 11:15 - 11:30 Break
Thursday 11:30 - 12:45 Confirmatory Data Analysis
Thursday 12:45 - 13:45 Lunch Break
Thursday 13:45 - 15:00 Data Visualization - Part 2
Thursday 15:00 - 15:15 Break
Thursday 15:15 - 16:30 Data Visualization - Part 2
Friday 10:00 - 11:15 Reporting with R Markdown
Friday 11:15 - 11:30 Break
Friday 11:30 - 12:45 Reporting with R Markdown
Friday 12:45 - 13:45 Lunch Break
Friday 13:45 - 15:00 Advanced Use of R, Outlook, Q&A
Friday 15:00 - 15:15 Break
Friday 15:15 - 16:30 Advanced Use of R, Outlook, Q&A

What is R?

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS (R Project website).

R is free and open-source software (FOSS) and also a programming language. More specifically, it is a free, non-commercial implementation of the [S programming language](https://en.wikipedia.org/wiki/S_(programming_language) (developed by Bell Laboratories).


A very brief history of R

If you want to know a bit more about the origins and history of R as well as the philosophy behind it, the book R Programming for Data Science by Roger D. Peng provides a good summary. Alternatively, you can also watch this YouTube video in which David Smith talks about Twenty Years of R.


ORigins


Why use R?

โ€“

โ€“

โ€“

โ€“

โ€“

โ€“


Fun with R

You can use R toโ€ฆ

โ€“

โ€“

โ€“

โ€“

โ€“


The versatility of R

Some of the things you can do and create with R includeโ€ฆ


Installing R

You can download R via the R Project website. The exact installation process depends on your operating system (OS). The R Cookbook provides a detailed explanation of the installation process for Windows, macOS, and Linux/Unix.

If you want or need to update your version of R, you can do this the same way as for the first-time installation. If you use Windows, you can also use the installr package to update R (we will talk about packages in a bit).


Graphical user interface (GUI) for R

R comes with a basic GUI (on Windows you can access it by opening the Rgui.exe file). However, it is quite limited in terms of its functionalities.


Integrated development interfaces (IDE) for R

Using an IDE provides several advantages, such as:


RStudio

RStudio is the most widely used IDE for R.1 In addition to the general advantages of an IDE, it has some specific ones:

.footnote[ [1] There are, of course, other IDEs that can be used with/for R. Another popular option is Visual Studio from Microsoft (for which an R extension is available).]


Installing RStudio

You can download the installer for your OS from the RStudio website. The R Cookbook also provides some more details on how to install and start RStudio.

When you open RStudio for the first time it should look like this (only in white instead of black and maybe not with R startup messages in German):


RStudio interface

The R console in RStudio

The console is the interactive input-output window of RStudio. You can enter commands here and press Enter to execute them. Typically, the output the the commands you enter into the console will also be displayed here.

If you see the > in the console, it means that it is ready to receive commands.

If you see a + at the beginning of the console input line, this means that the command is incomplete. A common reason for this is a missing ) or ". If you see the + at the beginning of the console input line, you can either complete the command (and then run it by pressing Enter/Return) or abort entering the command by pressing Esc.

Once you have executed at least one command in the console you can cycle through previous ones using โ†‘ and โ†“ on your keyboard.


R as a calculator

The simplest thing you can do with the R console is to use it as a calculator.

3+2
## [1] 5
2^3
## [1] 8
1/3
## [1] 0.3333333

Note: In the console, you wonโ€™t see the ## in the output. The [1] before the result indicates that this is the first output value of the command (more complex commands can have more than one output value).


R as a calculator

100^3
## [1] 1000000
1/2500
## [1] 0.0004

For printing very small and very large numbers, R uses scientific notation. If you want to avoid this, you can use the command options(scipen=999). NB: This setting will only be active for the current session.

options(scipen=999)
100^3
## [1] 1000000
1/2500
## [1] 0.0004

Objects in R

R is an object-oriented programming language. The simplest example of assignment in R is the assignment of a single value to an object. This value can, e.g., be an single number or a character string.

x <- 10
y <- "This is a character string"

x
## [1] 10
y
## [1] "This is a character string"

R objects in RStudio

Once one or more objects have been assigned values they also appear in the Environment tab in RStudio.


R workspace

The Environment tab in RStudio shows the content of your current working environment (also called workspace) which includes any used-defined objects. The contents of the current environment are stored in the working memory (RAM) of your computer until you exit R (or RStudio).

Note: The fact that R objects are stored in your computerโ€™s RAM can become problematic if you work with โ€œbig dataโ€. However, there are solutions for working with larger-than-RAM data in R (such as disk.frame).


Rs memory use

In the newest versions of RStudio, the Environment tab includes a small icon that displays the systemโ€™s overall memory use (displayed as a pie/donut chart) and the amount of RAM used by R (the number next to that).


Rs memory use

From the dropdown menu next to that icon, you can also select Memory Usage Report to get more detailed information about current working memory (RAM) use.


The two workhorses of R: Functions and packages ๐Ÿด

If you want to do anything in R, you need to use functions, and functions are provided through packages. We will go through the basics of functions and packages in R in the following.


Functions

Put simply, a function takes an input, does something with it, and produces some sort of output. Functions typically have arguments. In the simplest case, a function only requires an input (a value or object) as a single argument (some functions even require no argument).

sqrt(9)
## [1] 3
x <- 9
sqrt(x)
## [1] 3

The output of a function can, of course, also be assigned to an object.

x <- sqrt(9)
x
## [1] 3

Note: Technically, functions are also objects in R.


Functions

Most functions in R have more than one argument.

y <- "This is a character string"

# in the character string named y: replace i with X
gsub(pattern = "i", replacement = "X", y)
## [1] "ThXs Xs a character strXng"

Functions

If you want to know how to use a function, you can consult its help file. You can do that via the ? command:

?gsub # ?function_name

In RStudio, this will open a file in the Help tab.


Functions

Functions can have required and optional arguments. Required arguments need to be specified for a function to run, whereas optional arguments have defaults and, hence, do not have to be provided in a function call. You can easily identify required and optional arguments in the Usage section of the help file for a function: If the argument is in the format argument = value it is optional. If only the argument name is provided function(argument_1), this means that this argument is required.


Functions

Function arguments can be provided in the specified order or by referencing them by name (in which case the order can change). For example, the following two versions of the gsub function call are both valid.

y <- "This is a character string"

gsub("i", "X", y)
## [1] "ThXs Xs a character strXng"
gsub(y, replacement = "X", pattern = "i")
## [1] "ThXs Xs a character strXng"

Typing the argument names is more work but it increases the comprehensibility of your code for human readers.


Functions

If you want to understand the โ€œinner workingsโ€ of a function (or maybe use code from existing functions for writing your own functions), you can also print the function body by just running the function name without the parentheses behind it.

gsub
## function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE, 
##     fixed = FALSE, useBytes = FALSE) 
## {
##     if (is.factor(x) && length(levels(x)) < length(x)) {
##         gsub(pattern, replacement, levels(x), ignore.case, perl, 
##             fixed, useBytes)[x]
##     }
##     else {
##         if (!is.character(x)) 
##             x <- as.character(x)
##         .Internal(gsub(as.character(pattern), as.character(replacement), 
##             x, ignore.case, perl, fixed, useBytes))
##     }
## }
## <bytecode: 0x0000024b609c13c8>
## <environment: namespace:base>

class: center, middle

Exercise time ๐Ÿ‹๏ธโ€โ™€๏ธ๐Ÿ’ช๐Ÿƒ๐Ÿšด

R packages

The key elements of the R universe are its packages. They essentially are collections of functions (and sometimes also datasets) and provide some form of documentation for those.

The basic R system as well as a huge number of additional packages that extend its functionalities are available via The Comprehensive R Archive Network (CRAN).

CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R (CRAN website).


base R

When we talk about base R we typically refer to the set of packages that come with a new installation of R via CRAN.

There also is a package called base included with this but the base R system includes a number of other packages as well: utils, stats, datasets, graphics, grDevices, grid, methods, tools, parallel, compiler, splines, tcltk, stats4.

In addition, a new installation also includes the following โ€œrecommendedโ€ packages: boot, class, cluster, codetools, foreign, KernSmooth, lattice, mgcv, nlme, rpart, survival, MASS, spatial, nnet, Matrix.


Finding packages

CRAN provides an alphabetically sorted list with all available packages. You can search for your keywords of interest in that list, but that is not the most convenient option.

Two helpful resources for finding R packages:

  • CRAN Task Views provide curated lists of recommended packages for specific tasks/areas/topics

  • METACRAN allows you to search and browse all packages on CRAN

Of course, you can also use your search engine of choice and search for what you want to do plus โ€œR packageโ€ (example: โ€œANOVA R packageโ€), and we will introduce you to many useful packages for various purposes throughout this course.


Installing packages from CRAN in R

Installing packages from CRAN in R is very straightforward.

# Install a package
install.packages("correlation") # single or double quotation marks

# Install multiple packages at once
install.packages(c("correlation", "effectsize"))

R packages are installed in specific directories on your computer. NB: If you have multiple versions of R installed, there are directories for each version (with the exception of minor updates: e.g., 4.0.1 and 4.0.2 share the same folder for installed packages, whereas 3.6.0 and 3.7.0 do not). To find where packages are installed on your machine you can use the following command:

.libPaths()

Loading packages

Once you have installed a package, you need to load it to be able to use the functions (and/or datasets) it contains in your R session.

library(correlation) # no quotation marks needed

Other sources for R packages

While it is the main source, not all packages for R are available via CRAN. Another important source of R packages, especially those that are still in early development, is GitHub. To be able to install packages hosted on GitHub you need to use functions from the devtools or the remotes package (which you need to install first as they do not come with base R). For example, if you want to install the RPG dice roll package that I mentioned before:

.small[

# Option 1
library(devtools)
install_github("Felixmil/rollR") # last part of the GitHub URL (user name + repository name)

# Option 2
library(remotes)
install_github("Felixmil/rollR") # last part of the GitHub URL (user name + repository name)

]

Note: To be able to install packages from GitHub on Windows machines, you will need to install Rtools first.


Packages about packages

There are a few packages that facilitate the installation and loading of R packages (from various sources). Two popular ones are:


Installed packages

You can get information about the packages you have installed on your system with the following function:

installed.packages()

Managing packages with the RStudio GUI

You can also use the Packages tab in the RStudio GUI to install, load, update, and uninstall packages. You can load a package by clicking the checkbox on the left side of its name. However, to make sure that you (and others) can reproduce what you have done, you should ideally include the installation and loading of packages as part of your R scripts.


R scripts

While the console is useful for trying things out, you should not use it for your actual data analysis. For this you should use R scripts that allow you to store and document your code. R scripts are similar to syntax files for SPSS or do-files for Stata. R scripts have the file extension .R.

In RStudio, you can create a new script via the menu (File -> New File -> R Script), by clicking the small white sheet icon with the green + symbol and choosing R Script, or through the keyboard shortcut Ctrl + Shift + N (Windows & Linux)/Cmd + Shift + N (Mac). You can open an existing script by clicking on it in the files tab, by clicking the open folder icon, via File -> Open File, or using the keyboard shortcut Ctrl + O (Windows & Linux)/Cmd + O (Mac).


RStudio interface: Scripts

When you open or create a script in RStudio this will be displayed in a fourth pane (which will have multiple tabs if you open/create more than one R script or other types of source files).


Working with R scripts

You can write your code in an R script just like you do in the console.

If you want to execute a single command from your script in RStudio, you can do so by placing your cursor somewhere in command (or directly after it) and clicking the Run button in the menu or by using the keyboard shortcut Ctrl + Return (Windows & Linux)/Cmd + Enter (Mac). This also works if you select multiple lines of code/commands.

You can also run all commands in your script by selecting Run all from the dropdown menu next to the Run button or via the keyboard shortcut Ctrl + Alt + R (Windows & Linux)/Cmd + Option + R (Mac).

You can save your script in RStudio via File -> Save or Save As..., by clicking the small blue floppy disk icon, or through the keyboard shortcut Ctrl + S (Windows & Linux)/Cmd + S (Mac).


Commenting R scripts

To properly document your code (for your future self as well as other people who may use your code) it is good practice to use comments. In R scripts, you can create a comment by starting a line with a #.

In RStudio, to comment or uncomment one or more lines in a script you can also select them and use the keyboard shortcut Ctrl + Shift + C (Windows & Linux)/Cmd + Shift + C (Mac).

# this is a comment
library(tidyverse)

Setup and workflows for R and RStudio

In the following slides, we will present some suggestions for adopting a couple of settings and practices that help you develop and implement workflows for R and RStudio that minimize mess and increase reproducibility.

In this session, we will only cover the basics that are necessary for establishing such workflows. If you are interested in some further information on setting up and maintaining your installation of R and RStudio as well as the optimization of workflows, and troubleshooting, you can check out the appendix slides with additional materials that we have created on these subjects.

Note: Most of the recommendations in the following (as well as in the additional materials) are largely based on the freely available online book What They Forgot to Teach You About R.


Working directory

The working directory is where R will look for and save files by default.

You can check your current working directory with the following command:

getwd()

In RStudio, the current working directory is also displayed at the top of the Console tab.

There are two ways in which you can set/change your working directory:

  • using the RStudio GUI
  • using functions

Setting the working directory via the RStudio GUI

The RStudio menu Session -> Set Working Directory which provides different options:

  • โ€œTo Project Directoryโ€: can be used if you have an .Rproj file (more on that later)

  • โ€œTo Source File Locationโ€: sets the working directory to the location where the currently active source file - typically an R script - is stored

  • โ€œTo FilesPane Locationโ€: sets the working directory to the directory that is currently visible in the Files tab

  • โ€œChoose Directoryโ€: opens a file browser window that lets you choose a directory

To increase the reproducibility of your work, however, using functions in scripts is generally the better approach.


Setting the working directory using functions

You can set a working directory with the following command (of course, you need to replace the file path with the correct one for your system):

setwd("C:/Users/user/Documents/analysis")

Interlude: File paths

There are absolute (example: โ€œC:/Users/user/Documents/example.Rโ€) and relative file paths (example: โ€œ./r-scripts/example.Rโ€). Relative file paths are relative to the current working directory. Common shorthand options for relative file paths are . for the current (working) directory, .. for one folder level up (parent folder), and ~ for the home directory (which is the default working directory in R).

To facilitate the reuse of your code on other systems (by you or others), it is generally preferable to use relative file paths.

Note: R uses Unix-style file paths with /, while Windows uses \ in file paths. However \\ also works in R. There is a Stackoverflow post discussing several ways of dealing with that. A helpful tool in this context is Path Copy Copy which is an add-on for the Windows file explorer that lets you copy file paths in different formats.


Special features of RStudio

There are quite a few features of RStudio that can make your life as an R user much easier. We will briefly discuss four of them in the following:1

  • RStudio projects

  • Keyboard shortcuts

  • Autocomplete for code

  • Customization options

.footnote[ [1] If you want to discover some more of the benefits of using RStudio, you can check out the appendix slides.]


RStudio projects

RStudio projects are helpful tool for developing a project-oriented workflow that can enhance reproducibility.

You can create a project via the RStudio menu: File -> New Project. RStudio projects are associated with .Rproj files that contain some specific settings for the project. If you double-click on a .Rproj file, this opens a new instance of RStudio with the working directory and file browser set to the location of that file (the repository/folder for this workshop contains an .Rproj file, if you want to try this out).

Explaining RStudio projects in detail is beyond the scope of this course, but there are good tutorials available, e.g., on the RStudio support site or in the respective chapter in What They Forgot to Teach You About R.


Keyboard shortcuts in RStudio

RStudio offers a wide range of useful keyboard shortcuts. You can access a Keyboard Shortcut Quick Reference in RStudio via Help -> Keyboard Shortcuts Help. There even is a keyboard shortcut for accessing the keyboard shortcuts help (very meta): Alt + Shift + K (Windows & Linux)/Option + Shift + K (Mac).

One RStudio keyboard shortcut that is particularly helpful for writing R code is the one for the assignment operator: Alt + - (Windows & Linux)/Option + - (Mac).


Autocomplete in RStudio

Once you start typing a command in RStudio (in the console or a script), RStudio will make autocomplete suggestions (for functions but also other objects). You can cycle through these suggestions using โ†‘ and โ†“ on your keyboard. If you move your mouse cursor to one of the suggestions, RStudio displays an excerpt from the help file of that function. You can accept a suggestion by selecting it and pressing Tab.


General settings for RStudio

By default, R stores your workspace and command history when closing a session (and also restores the former upon startup). While this can be helpful, this creates files that you probably will not use, and can also be a barrier for adopting reproducible workflows.1

To avoid that, there are some general settings in RStudio that you might want to change via Tools -> Global Options -> General.

.footnote[ [1] Again, if you want to know more, have a look at the appendix slides.]


Basic workflow and setup recommendations

  • use R scripts to store your code

  • save/export important output in appropriate file formats (more on that in the following session on Data Import & Export)

  • (try to) use relative file paths in your scripts

  • eventually consider adopting a project-based workflow (using .Rproj files)


Troubleshooting 101

In case you get an error message or if your R session crashes, there are a couple of things you can do/try out:

  • copy the error message into your preferred search engine

  • abort R process: Session -> Terminate R in the RStudio menu or by clicking the stop shield icon in the upper right corner of the console

  • Restart R (RStudio menu: Session -> Restart R) or RStudio

  • re-install packages

.center[ Source: https://s.unhb.de/DqKxb]


Common sources of errors in your R code

  • typos (e.g., capitalization in package names)

  • missing or unmatched (, ', or " (often at the end of a command)

  • \ instead of / in file paths (e.g., when copied from the Windows explorer)

  • packages not installed or loaded

  • code (chunks) executed in the wrong order

.center[ GIF by Allison Horst]


class: center, middle

Exercise time ๐Ÿ‹๏ธโ€โ™€๏ธ๐Ÿ’ช๐Ÿƒ๐Ÿšด

Extracurricular activities